In [ ]:
%%HTML
<style>
.container { width:100% }
</style>
In [ ]:
import pandas as pd
import numpy as np
In [ ]:
IrisDF = pd.read_csv('iris.csv')
IrisDF.head()
We extract the set of all species occurring in the DataFrame
and convert this list into a set.
In [ ]:
Species = list(set(IrisDF['species']))
Species
We extract the feature names. This can be done conveniently by converting the DataFrame
into a list. However, we do not need the feature 'species'
since this is the dependent variable. Fortunately, this feature is the last element in the list, so we can easily drop it.
In [ ]:
Features = list(IrisDF)[:-1]
Features
SciKitLearn provides a classifier that is based on the Naive Bayes algorithm that assumes that continuous variables have a Gaussian distribution.
In [ ]:
from sklearn.naive_bayes import GaussianNB
We extract the independent variables and store them in the design matrix X
.
In [ ]:
X = IrisDF[Features]
We extract the dependent variable and store it in Y
.
In [ ]:
Y = IrisDF['species']
We construct a naive Bayes classifier that assumes a normal distribution and fit the model with our data. This classifier assumes that $$ P(f=x | C) = \frac{1}{\sqrt{2\cdot\pi\;}\cdot \sigma_{f,C}} \cdot \exp\left(-\frac{\bigl(x-\mu_{f,C}\bigr)^2}{2 \cdot \sigma_{f,C}^2}\right). $$ Here $ P(f=x | C)$ is the conditional probability density that the feature $f$ has the value $x$ given that $C$ is the species of the flower under investigation. $\mu_{f,C}$ is the mean value of the feature $f$ for the class $C$, while $\sigma_{f,C}^2$ is the variance of the feature $f$ for the class $C$.
In [ ]:
classifier = GaussianNB()
We train the classifier with our data.
In [ ]:
classifier.fit(X, Y)
We compare the predicted values of our classifier with the actual values and compute the accuracy.
In [ ]:
np.sum(classifier.predict(X) == Y) / len(Y)
In [ ]: